Battle royale games have surged in popularity in recent years. The premise of such games is as follows: players are dropped onto a fictional island and fight to be the last person standing. As they roam around the island, they loot for weapons and items crucial for their survival. Players can choose to join a game as a solo player or with a group of friends (4 players maximum). When playing solo, players are immediately eliminated when they are killed. However, in group play, killed individuals can be revived by their teammates.
We are interested in building a prediction model for the popular battle royale game PUBG (PlayerUnknown’s Battlegrounds). In PUBG, players not only have to worry about getting killed by other players, but they also have to stay within the shrinking “safe zone,” which effectively forces players into contact with each other. Outside of the “safe zone,” players take damage to their health at increasing rates.
Through our analysis, we aim to understand which playing strategies are more successful than others: How aggressive are the playing styles of the winners? Is it better to land in a densely or sparsely populated area? Do players who travel farther on the map tend to place higher or lower? Answers to such questions will be of high interest for the PUBG gaming community.
First, we want to investigate how well we can predict a player’s placement based on their in-game actions. What actions or statistics are most predictive of their placement? Exploring this question can then provide insight into how different playing styles compare. We would like to be able to build a model that accurately predicts a player’s game performance, but also allows us to draw inferences about whether certain playing styles are more successful.
The data comes from the Kaggle competition. To download the data, join the Kaggle competition and run the shell script download_data.sh.
Note: We will need to provide a direct download link for the TA.
data.url <- paste0("https://www.dropbox.com/s/319vkfevkfb6kqt/all.zip?dl=1")
if(!file.exists("./data/pubg.zip")){
dir.create("./data")
download.file(data.url, destfile = "./data/pubg.zip", mode = "wb")
unzip("./data/pubg.zip", exdir = "./data/pubg")
}
# Warning: Very large datasets. Read 10000 samples before scaling up.
raw_dat <- read_csv("data/pubg/train_V2.csv", n_max = 10000)
test_dat <- read_csv("data/pubg/test_V2.csv")
Each row in the data contains one player’s post-game stats. A description of all data fields is provided in pubg_codebook.csv. We will focus on the solo game mode (match_type is solo, solo-fpp, or normal-solo-fpp). The solo game mode constitutes about 15% of the data. The outcome variable we are trying to predict is win_place_perc.
# Select single-player data only
# Clean names
# Remove features that are not relevant to single-players
# Change player_id and match_id to factors
clean_dat <- raw_dat %>%
clean_names() %>%
filter(match_type %in% c("solo", "solo-fpp", "normal-solo-fpp")) %>%
select(-dbn_os, -assists, -revives, -group_id, -match_type, -team_kills) %>%
mutate(id = as.factor(id), match_id = as.factor(match_id))
We are given a training set and a test set. The outcome variable for the test set will not be given to us until the end of the Kaggle competition in Jan. 30th, 2019. Therefore, for the purposes of this project, we will only be using the provided training set. Within the training set, we will create our own training and test set.
# Split into train and test set
train_ind = createDataPartition(y = clean_dat$win_place_perc, p = 0.8, list = F)
train = clean_dat %>%
slice(train_ind)
test = clean_dat %>%
slice(-train_ind)
head(train)
# A tibble: 6 x 23
id match_id boosts damage_dealt headshot_kills heals kill_place
<fct> <fct> <int> <dbl> <int> <int> <int>
1 315c… 6dc8ff8… 0 100 0 0 45
2 311b… 2926117… 0 8.54 0 0 48
3 b780… 2c30ddf… 1 324. 1 5 5
4 9202… 07948d7… 3 254. 0 12 13
5 4714… bc2faec… 0 137. 0 0 37
6 0ba4… f7cb761… 0 194. 1 1 19
# ... with 16 more variables: kill_points <int>, kills <int>,
# kill_streaks <int>, longest_kill <dbl>, match_duration <int>,
# max_place <int>, num_groups <int>, rank_points <int>,
# ride_distance <dbl>, road_kills <int>, swim_distance <dbl>,
# vehicle_destroys <int>, walk_distance <dbl>, weapons_acquired <int>,
# win_points <int>, win_place_perc <dbl>
We first explored the distribution of each feature by the final finish percentile. Individuals were first classified into 0-19th, 20th-39th, 40th-59th, 60th-79th, 80th-99th, and 100th (winners) percentile finish across all games. Then, we plotted the density of features by these percentiles.
It is important to note that the density plots aggregate by percent finish in a game. Thus, it is possible for one individual to place within the 10th percentile in one game, but then finish in the 90th percentile in another. This individual would contribute to the approximated density for both the 0-19th percentile and the 80th-99th percentile.
train %>% mutate(win_place_cat = as.factor(floor(win_place_perc / 2 * 10) * 20)) %>%
gather("feature", "value", -match_id, -match_duration,
-id, -win_place_perc, -win_place_cat) %>%
ggplot(aes(x = value, group = win_place_cat, color = win_place_cat)) +
facet_wrap(feature ~., scales = "free") +
geom_density() +
labs(title = "Distribution of Features by Finish Percentile",
x = "Value of Features", y = "Density", color = "Percentile") +
scale_color_hue(labels = c("0-19", "20-39", "40-59", "60-79", "80-99", "100")) +
theme_bw()
This plot has some very interesting features:
kill_place has a bimodal distribution. You have people in the 10th percentile finish who have high kills and also low kill ranks (maybe this is reflected in kill_points, people who have high kill_points will be in the category of: high kills per game (so high kills rank) but low finish percentage. )longest_kill, ride_distance, swim_distnace, etc.). Maybe we may want to look at the log-transformations of these data?Additional plots we might want:
corr_matrix = test %>% select(-id, -match_id) %>% cor()
corrplot(corr_matrix, method = "circle")
# Data Analysis (Modeling)